UMAP (Uniform Manifold Approximation and Projection)#

UMAP is a dimensionality reduction algorithm that builds a 2D/3D map where nearby points in high dimensions tend to stay nearby. It’s commonly used for:

  • visualization (like t-SNE)

  • building features for clustering / downstream models

  • fast, scalable “manifold-ish” embeddings

This notebook keeps the style consistent with the supervised notebooks: intuition first, math second, with Plotly experiments.


Learning goals#

By the end you should be able to:

  • explain UMAP with two mental models: “Stretching a rubber sheet” and “Local neighborhoods with global awareness”

  • describe the math pipeline: manifold assumption → fuzzy simplicial set → graph → cross-entropy optimization

  • tune the key hyperparameters: n_neighbors, min_dist, and metric

  • compare UMAP vs t-SNE vs Isomap (strengths, weaknesses, when to use what)

Prerequisites (light)#

  • nearest neighbors / distance metrics

  • graphs (weighted edges) + the idea of optimizing a loss

Notation#

  • Data: \(X \in \mathbb{R}^{n\times d}\) (rows are samples)

  • High-dim distance: \(d(x_i, x_j)\)

  • Low-dim embedding: \(Y \in \mathbb{R}^{n\times m}\) (usually \(m=2\))

  • Neighborhood size: \(k\) (UMAP’s n_neighbors)


Table of contents#

  1. Intuition (rubber sheet + neighborhoods)

  2. Mathematical foundation

  3. Hyperparameters explained

  4. Plotly experiments

  5. Practical guidance

  6. Comparison table (UMAP vs t-SNE vs Isomap)

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from plotly.subplots import make_subplots

from sklearn.datasets import load_digits, make_swiss_roll
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap, TSNE, trustworthiness
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

from scipy.sparse import csr_matrix
from scipy.sparse.csgraph import shortest_path
from scipy.spatial.distance import pdist
from scipy.stats import spearmanr


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

try:
    from umap import UMAP
except ModuleNotFoundError:
    UMAP = None
    print("Optional dependency missing: `umap-learn` (UMAP).")
    print("Install with: pip install umap-learn")
Optional dependency missing: `umap-learn` (UMAP).
Install with: pip install umap-learn

1) Intuition: “Stretching a rubber sheet”#

UMAP is easiest to remember as a two-part story:

Metaphor A: “Stretching a rubber sheet”#

Imagine the data lie on a wrinkled surface embedded in a high-dimensional space. If you could grab that surface and carefully flatten it onto a table, you’d want to:

  • keep nearby points nearby (don’t tear local patches)

  • allow the sheet to stretch where needed (some distortion is inevitable)

That’s the manifold idea: the data are high-dimensional, but the degrees of freedom are lower.

Metaphor B: “Local neighborhoods with global awareness”#

UMAP starts by asking, for each point:

“Who are your closest friends (neighbors), and how strongly do you belong together?”

Then it places points in 2D so that all those local friendships can be satisfied at once. That second step is where the “global awareness” comes from: the global layout emerges as a compromise between many local constraints.


A classic manifold toy problem is the swiss roll: it’s 3D data that’s really a 2D sheet rolled up.

# A classic "manifold" toy dataset: the swiss roll
X_swiss, t_swiss = make_swiss_roll(n_samples=1400, noise=0.06, random_state=42)
t_swiss = t_swiss.astype(np.float64)

fig = make_subplots(
    rows=1,
    cols=2,
    specs=[[{"type": "scene"}, {"type": "xy"}]],
    subplot_titles=("Observed data in 3D", "Intrinsic coordinates (t vs height)"),
)

fig.add_trace(
    go.Scatter3d(
        x=X_swiss[:, 0],
        y=X_swiss[:, 1],
        z=X_swiss[:, 2],
        mode="markers",
        marker=dict(
            size=3,
            color=t_swiss,
            colorscale="Viridis",
            opacity=0.85,
            colorbar=dict(title="t"),
        ),
        showlegend=False,
    ),
    row=1,
    col=1,
)

# A "ground truth" 2D view: swiss roll is basically (t, height)
fig.add_trace(
    go.Scatter(
        x=t_swiss,
        y=X_swiss[:, 1],
        mode="markers",
        marker=dict(size=3, color=t_swiss, colorscale="Viridis", opacity=0.85, showscale=False),
        showlegend=False,
    ),
    row=1,
    col=2,
)

fig.update_layout(title="Manifold intuition: the swiss roll is 2D in disguise", height=460)
fig.update_scenes(xaxis_title="x", yaxis_title="y (height)", zaxis_title="z")
fig.update_xaxes(title_text="t (roll position)", row=1, col=2)
fig.update_yaxes(title_text="y (height)", row=1, col=2)
fig.show()

2) Mathematical foundation (intuition → math)#

UMAP’s full paper is deep, but the “working mental model” can be built from four pieces.

2.1 Manifold assumption#

Intuition:

  • The data live on (or near) a low-dimensional manifold.

  • Locally, the manifold looks approximately flat, so nearest neighbors are meaningful.

  • A good embedding stitches many local patches together.

2.2 Fuzzy simplicial sets#

Intuition:

Instead of saying “edge or no edge” between points, UMAP uses fuzzy membership strengths in \([0,1]\).

  • “1” means “very confidently connected”

  • “0” means “not connected”

Math (core idea):

UMAP builds a \(k\)-nearest-neighbor graph and assigns a directed membership from \(i\) to neighbor \(j\):

\[\mu_{i\mid j} = \exp\left(-\frac{\max(0,\; d(x_i, x_j) - \rho_i)}{\sigma_i}\right)\]

where:

  • \(\rho_i\) is a small local “connectivity” offset (often near the distance to the closest neighbor)

  • \(\sigma_i\) is a local scale chosen so that each point has a comparable neighborhood “mass”

Then it symmetrizes directed memberships with a fuzzy-OR:

\[\mu_{ij} = \mu_{i\mid j} + \mu_{j\mid i} - \mu_{i\mid j}\mu_{j\mid i}\]

2.3 Graph construction#

After symmetrization, you can think of UMAP as building a weighted graph whose edge weights are \(\mu_{ij}\). That graph is the data’s “fuzzy topology” in a form we can optimize.

2.4 Cross-entropy optimization#

Intuition:

We want a low-dimensional layout where:

  • edges with high \(\mu_{ij}\) become close

  • pairs with low \(\mu_{ij}\) are encouraged to be far

Math (core objective):

UMAP defines a low-dimensional similarity (a smooth “attraction curve”), often written as:

\[\nu_{ij} = \frac{1}{1 + a\,\lVert y_i - y_j \rVert^{2b}}\]

and minimizes a cross-entropy between \(\mu\) (high-dim graph) and \(\nu\) (low-dim layout):

\[\mathcal{L} = -\sum_{i<j} \Big[\mu_{ij}\log(\nu_{ij}) + (1-\mu_{ij})\log(1-\nu_{ij})\Big]\]

In practice, UMAP uses negative sampling: it doesn’t sum over all non-edges explicitly.

# A small, simplified demo of the "fuzzy edge" idea (not a full UMAP implementation).
# The goal is to visualize what “fuzzy membership strengths” look like.


def _binary_search_sigma(distances: np.ndarray, rho: float, target: float, n_iters: int = 64) -> float:
    lo, hi = 1e-4, 1000.0
    for _ in range(n_iters):
        mid = (lo + hi) / 2.0
        weights = np.exp(-(np.maximum(0.0, distances - rho)) / max(mid, 1e-12))
        s = float(weights.sum())
        if s > target:
            hi = mid
        else:
            lo = mid
    return hi


def simplified_fuzzy_graph(
    X: np.ndarray,
    n_neighbors: int = 15,
    metric: str = "euclidean",
) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    nn = NearestNeighbors(n_neighbors=n_neighbors + 1, metric=metric).fit(X)
    dists, idx = nn.kneighbors(X)
    dists = dists[:, 1:]
    idx = idx[:, 1:]

    rho = dists[:, 0].copy()
    target = float(np.log2(n_neighbors))
    sigmas = np.array([_binary_search_sigma(dists[i], float(rho[i]), target) for i in range(X.shape[0])])
    sigmas = np.maximum(sigmas, 1e-12)

    directed = np.exp(-(np.maximum(0.0, dists - rho[:, None])) / sigmas[:, None])

    # Dense adjacency for demo-sized X
    n = X.shape[0]
    A = np.zeros((n, n), dtype=np.float64)
    rows = np.repeat(np.arange(n), n_neighbors)
    cols = idx.ravel()
    A[rows, cols] = directed.ravel()

    sym = A + A.T - A * A.T
    return sym, idx, dists, rho, sigmas


# Run the demo on a small subset of the swiss roll
demo_idx = rng.choice(np.arange(X_swiss.shape[0]), size=240, replace=False)
X_demo = X_swiss[demo_idx]

W, knn_idx, knn_dists, rho, sigma = simplified_fuzzy_graph(X_demo, n_neighbors=15)

i = 0
weights_i = np.exp(-(np.maximum(0.0, knn_dists[i] - rho[i])) / sigma[i])

fig = go.Figure()
fig.add_trace(go.Scatter(x=knn_dists[i], y=weights_i, mode="markers+lines"))
fig.update_layout(
    title="Fuzzy edge strengths: distance → membership (one point)",
    xaxis_title="distance to neighbor",
    yaxis_title="membership strength",
)
fig.show()

fig = px.imshow(
    W,
    color_continuous_scale="Blues",
    aspect="auto",
    title="(Demo) Symmetrized fuzzy graph weights μ_ij",
)
fig.update_layout(height=440)
fig.show()

3) Hyperparameters explained (the ones that matter)#

n_neighbors#

Intuition: how many “friends” each point tries to keep close.

  • smaller values (e.g., 5–15) emphasize very local structure (tight clusters, fine details)

  • larger values (e.g., 30–100+) encourage more global coherence (how clusters relate to each other)

If you’re getting a “spidery” map with lots of tiny islands, try increasing n_neighbors.

min_dist#

Intuition: how tightly points are allowed to pack together in the embedding.

  • smaller values (e.g., 0.0–0.1) allow dense clumps (clear cluster separation)

  • larger values (e.g., 0.3–0.8) produce a more even spread (less clumping)

A good way to remember it: min_dist is mostly about visual density, not “truth.”

metric choice#

Intuition: defines what “nearby” means before UMAP ever starts.

  • euclidean: a solid default for continuous/tabular features (after scaling)

  • cosine: common for text embeddings / high-dimensional sparse-ish vectors

  • manhattan, correlation, etc.: sometimes better when features have specific meaning

If the metric doesn’t match your notion of similarity, no dimensionality reduction method can save it.

# Dataset for hyperparameter/stability experiments: handwritten digits (64D)
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

# Standardize + a bit of PCA to speed non-linear methods
X_digits = StandardScaler().fit_transform(X_digits)
X_digits = PCA(n_components=30, random_state=42).fit_transform(X_digits)

# Keep a moderately sized subset for interactive speed
n_digits = 900
sub_idx = rng.choice(np.arange(X_digits.shape[0]), size=n_digits, replace=False)
X_d = X_digits[sub_idx]
y_d = y_digits[sub_idx]


def fit_umap(
    X: np.ndarray,
    n_neighbors: int = 15,
    min_dist: float = 0.1,
    metric: str = "euclidean",
    random_state: int = 42,
) -> np.ndarray | None:
    if UMAP is None:
        return None
    model = UMAP(
        n_components=2,
        n_neighbors=int(n_neighbors),
        min_dist=float(min_dist),
        metric=metric,
        random_state=random_state,
    )
    return model.fit_transform(X)


def fit_tsne(
    X: np.ndarray,
    perplexity: float = 30.0,
    metric: str = "euclidean",
    random_state: int = 42,
) -> np.ndarray:
    return TSNE(
        n_components=2,
        perplexity=float(perplexity),
        metric=metric,
        init="pca",
        learning_rate="auto",
        max_iter=1000,
        random_state=random_state,
    ).fit_transform(X)


def fit_isomap(X: np.ndarray, n_neighbors: int = 10) -> np.ndarray:
    return Isomap(n_neighbors=int(n_neighbors), n_components=2).fit_transform(X)


def _digit_colorbar() -> dict:
    return dict(
        title="digit",
        tickmode="array",
        tickvals=list(range(10)),
        ticktext=[str(i) for i in range(10)],
    )


def scatter_digits(emb: np.ndarray, title: str, show_scale: bool = False) -> go.Scatter:
    return go.Scatter(
        x=emb[:, 0],
        y=emb[:, 1],
        mode="markers",
        marker=dict(
            size=4,
            opacity=0.85,
            color=y_d,
            colorscale="Turbo",
            showscale=show_scale,
            colorbar=_digit_colorbar() if show_scale else None,
            line=dict(color="white", width=0.5),
        ),
        showlegend=False,
        name=title,
        hovertemplate="x=%{x:.2f}<br>y=%{y:.2f}<br>digit=%{marker.color}<extra></extra>",
    )

4) Plotly experiments#

We’ll focus on three visual questions:

  1. How do n_neighbors and min_dist change the map?

  2. How stable are UMAP and t-SNE across random seeds?

  3. What do we mean by “local vs global” structure preservation?

# 4.1 Effect of hyperparameters: n_neighbors × min_dist (digits)

if UMAP is None:
    print("Skipping UMAP hyperparameter sweep (install umap-learn to run this section).")
else:
    n_neighbors_grid = [5, 15, 50]
    min_dist_grid = [0.0, 0.1, 0.5]

    titles = [
        f"k={k}, min_dist={md}" for md in min_dist_grid for k in n_neighbors_grid
    ]
    fig = make_subplots(
        rows=len(min_dist_grid),
        cols=len(n_neighbors_grid),
        subplot_titles=titles,
        horizontal_spacing=0.03,
        vertical_spacing=0.06,
    )

    for r, md in enumerate(min_dist_grid, start=1):
        for c, k in enumerate(n_neighbors_grid, start=1):
            emb = fit_umap(X_d, n_neighbors=k, min_dist=md, metric="euclidean", random_state=42)
            assert emb is not None
            fig.add_trace(scatter_digits(emb, title=f"k={k}, md={md}", show_scale=(r == 1 and c == 1)), row=r, col=c)
            fig.update_xaxes(showticklabels=False, row=r, col=c)
            fig.update_yaxes(showticklabels=False, row=r, col=c)

    fig.update_layout(
        title="UMAP hyperparameters: `n_neighbors` (k) vs `min_dist` (digits)",
        height=760,
    )
    fig.show()
Skipping UMAP hyperparameter sweep (install umap-learn to run this section).
# Metric choice demo: euclidean vs cosine (same k and min_dist)

if UMAP is None:
    print("Skipping UMAP metric comparison (install umap-learn to run this section).")
else:
    metrics = ["euclidean", "cosine"]
    fig = make_subplots(rows=1, cols=2, subplot_titles=[f"metric={m}" for m in metrics])
    for col, m in enumerate(metrics, start=1):
        emb = fit_umap(X_d, n_neighbors=15, min_dist=0.1, metric=m, random_state=42)
        assert emb is not None
        fig.add_trace(scatter_digits(emb, title=m, show_scale=(col == 1)), row=1, col=col)
        fig.update_xaxes(showticklabels=False, row=1, col=col)
        fig.update_yaxes(showticklabels=False, row=1, col=col)

    fig.update_layout(title="UMAP: metric choice changes what “nearby” means", height=420)
    fig.show()
Skipping UMAP metric comparison (install umap-learn to run this section).

4.2 UMAP vs t-SNE stability (run-to-run)#

Both UMAP and t-SNE are stochastic optimizations. Two important takeaways:

  • always set random_state when you want reproducibility

  • don’t over-interpret global rotations/flips (a “good” map can be rotated and still be the same)

A simple rotation-invariant sanity check is: do nearest neighbors in the 2D map stay similar across runs?

def avg_knn_overlap(a: np.ndarray, b: np.ndarray, k: int = 10) -> float:
    nn_a = NearestNeighbors(n_neighbors=k + 1).fit(a)
    nn_b = NearestNeighbors(n_neighbors=k + 1).fit(b)
    na = nn_a.kneighbors(return_distance=False)[:, 1:]
    nb = nn_b.kneighbors(return_distance=False)[:, 1:]
    overlaps = [len(set(ra) & set(rb)) / k for ra, rb in zip(na, nb)]
    return float(np.mean(overlaps))


seeds = [0, 1, 2]

# Compute embeddings
umap_embeds = []
if UMAP is not None:
    umap_embeds = [fit_umap(X_d, n_neighbors=15, min_dist=0.1, metric="euclidean", random_state=s) for s in seeds]
    assert all(e is not None for e in umap_embeds)

tsne_embeds = [fit_tsne(X_d, perplexity=30.0, metric="euclidean", random_state=s) for s in seeds]

# Visual: small multiples
rows = 2 if UMAP is not None else 1
row_titles = ["UMAP" if UMAP is not None else "t-SNE", "t-SNE"]
fig = make_subplots(
    rows=rows,
    cols=len(seeds),
    subplot_titles=[f"seed={s}" for s in seeds] * rows,
    vertical_spacing=0.08,
)

if UMAP is not None:
    for col, (s, emb) in enumerate(zip(seeds, umap_embeds), start=1):
        fig.add_trace(scatter_digits(emb, title=f"UMAP seed={s}", show_scale=(col == 1)), row=1, col=col)
        fig.update_xaxes(showticklabels=False, row=1, col=col)
        fig.update_yaxes(showticklabels=False, row=1, col=col)

tsne_row = 2 if UMAP is not None else 1
for col, (s, emb) in enumerate(zip(seeds, tsne_embeds), start=1):
    fig.add_trace(scatter_digits(emb, title=f"t-SNE seed={s}", show_scale=(UMAP is None and col == 1)), row=tsne_row, col=col)
    fig.update_xaxes(showticklabels=False, row=tsne_row, col=col)
    fig.update_yaxes(showticklabels=False, row=tsne_row, col=col)

fig.update_layout(title="Run-to-run variability (digits): UMAP vs t-SNE", height=520 if rows == 2 else 300)
fig.show()

# Quantify: average overlap of kNN neighborhoods in the embedding
rows = []
k = 10

if UMAP is not None:
    base = umap_embeds[0]
    for s, emb in zip(seeds, umap_embeds):
        rows.append({"method": "UMAP", "seed": s, "avg_knn_overlap_vs_seed0": avg_knn_overlap(base, emb, k=k)})

base = tsne_embeds[0]
for s, emb in zip(seeds, tsne_embeds):
    rows.append({"method": "t-SNE", "seed": s, "avg_knn_overlap_vs_seed0": avg_knn_overlap(base, emb, k=k)})

df_stability = pd.DataFrame(rows)
fig = px.bar(
    df_stability,
    x="seed",
    y="avg_knn_overlap_vs_seed0",
    color="method",
    barmode="group",
    title=f"Stability metric: avg kNN overlap in embedding (k={k}; higher = more stable)",
)
fig.update_yaxes(range=[0, 1.0], title="avg overlap")
fig.show()

4.3 Local vs global structure preservation#

A useful mental split:

  • Local structure: do you keep the same nearest neighbors?

  • Global structure: do long-range relationships (“how far apart”) make sense?

No 2D map can perfectly preserve all distances, so methods pick tradeoffs.

We’ll compare on the swiss roll:

  • Isomap: tries to preserve geodesic (on-manifold) distances → strong global unfolding when the manifold is clean.

  • t-SNE: heavily emphasizes local neighborhoods → great cluster visualization, weaker global geometry.

  • UMAP: also neighborhood-based, but often shows more coherent global structure than t-SNE (still not a guarantee).

# Swiss roll embeddings + a simple local/global metric comparison

X_sr, t_sr = make_swiss_roll(n_samples=1100, noise=0.08, random_state=7)
t_sr = t_sr.astype(np.float64)
X_sr = StandardScaler().fit_transform(X_sr)

embeddings: dict[str, np.ndarray] = {}
embeddings["Isomap (k=10)"] = fit_isomap(X_sr, n_neighbors=10)
embeddings["t-SNE (perp=30)"] = fit_tsne(X_sr, perplexity=30.0, random_state=42)

if UMAP is not None:
    embeddings["UMAP (k=10)"] = fit_umap(X_sr, n_neighbors=10, min_dist=0.1, random_state=42)
    embeddings["UMAP (k=80)"] = fit_umap(X_sr, n_neighbors=80, min_dist=0.1, random_state=42)


def geodesic_spearman(X: np.ndarray, emb: np.ndarray, n_neighbors: int = 10) -> float:
    nn = NearestNeighbors(n_neighbors=n_neighbors + 1).fit(X)
    dists, idx = nn.kneighbors(X)
    dists = dists[:, 1:]
    idx = idx[:, 1:]

    n = X.shape[0]
    rows = np.repeat(np.arange(n), n_neighbors)
    cols = idx.ravel()
    data = dists.ravel()

    graph = csr_matrix((data, (rows, cols)), shape=(n, n))
    graph = graph.maximum(graph.T)
    geo = shortest_path(graph, directed=False)

    iu = np.triu_indices(n, k=1)
    geo_vec = geo[iu]
    emb_vec = pdist(emb)
    mask = np.isfinite(geo_vec)

    corr = spearmanr(geo_vec[mask], emb_vec[mask]).correlation
    return float(corr)


# Visual comparison
names = list(embeddings.keys())
cols = 2
rows = (len(names) + cols - 1) // cols

fig = make_subplots(rows=rows, cols=cols, subplot_titles=names, vertical_spacing=0.10)

for i, name in enumerate(names):
    emb = embeddings[name]
    if emb is None:
        continue
    r, c = i // cols + 1, i % cols + 1
    fig.add_trace(
        go.Scatter(
            x=emb[:, 0],
            y=emb[:, 1],
            mode="markers",
            marker=dict(
                size=3,
                opacity=0.85,
                color=t_sr,
                colorscale="Viridis",
                showscale=(i == 0),
                colorbar=dict(title="t") if i == 0 else None,
            ),
            showlegend=False,
        ),
        row=r,
        col=c,
    )
    fig.update_xaxes(showticklabels=False, row=r, col=c)
    fig.update_yaxes(showticklabels=False, row=r, col=c)

fig.update_layout(title="Swiss roll embeddings (color = intrinsic roll position t)", height=600)
fig.show()

# Metrics (computed on a subset to keep it interactive)
n_metric = 420
metric_idx = rng.choice(np.arange(X_sr.shape[0]), size=n_metric, replace=False)

X_m = X_sr[metric_idx]
metrics = []
for name, emb in embeddings.items():
    if emb is None:
        continue
    emb_m = emb[metric_idx]
    metrics.append(
        {
            "method": name,
            "trustworthiness@10": trustworthiness(X_m, emb_m, n_neighbors=10),
            "geodesic_spearman": geodesic_spearman(X_m, emb_m, n_neighbors=10),
        }
    )

df_metrics = pd.DataFrame(metrics)

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Local: trustworthiness@10 (higher = better)", "Global: geodesic Spearman (higher = better)"),
)
fig.add_trace(go.Bar(x=df_metrics["method"], y=df_metrics["trustworthiness@10"], name="trustworthiness"), row=1, col=1)
fig.add_trace(go.Bar(x=df_metrics["method"], y=df_metrics["geodesic_spearman"], name="geodesic Spearman"), row=1, col=2)
fig.update_xaxes(tickangle=-25)
fig.update_yaxes(range=[0.0, 1.0], row=1, col=1)
fig.update_yaxes(range=[-0.1, 1.0], row=1, col=2)
fig.update_layout(title="Local vs global preservation (swiss roll; subset for metrics)", height=420, showlegend=False)
fig.show()

5) Practical guidance#

When UMAP often beats t-SNE#

  • you need speed + scale (UMAP is typically faster on larger datasets)

  • you want an out-of-sample transform (UMAP(...).fit(X_train).transform(X_new))

  • you care about a bit more global coherence than typical t-SNE visualizations

  • you want flexible distance metrics (e.g., cosine) and reasonable results

When it doesn’t#

  • you want the most “separated-looking” clusters for visualization; t-SNE can be more dramatic

  • you’re in a regime where the neighborhood graph is unreliable (very noisy data, wrong metric, not scaled)

  • you need faithful global distances (2D can’t do that; consider PCA/MDS for distance-faithful goals)

A simple workflow#

  1. Start with metric="euclidean" (after scaling) and n_neighbors=15, min_dist=0.1.

  2. If the map is too “fragmented”: increase n_neighbors.

  3. If clusters look too smeared: decrease min_dist.

  4. If your notion of similarity isn’t Euclidean: change metric before tuning the other knobs.

6) Comparison table: UMAP vs t-SNE vs Isomap#

The best method depends on what you mean by “structure.” Here’s a high-level comparison.

comparison = pd.DataFrame(
    [
        {
            "Method": "UMAP",
            "Main idea": "Match a fuzzy neighbor graph via cross-entropy",
            "Strength": "Fast, scalable, often decent global layout",
            "Weakness": "Can still distort global distances; sensitive to metric",
            "Key knobs": "n_neighbors, min_dist, metric",
            "Out-of-sample?": "Yes (transform)",
        },
        {
            "Method": "t-SNE",
            "Main idea": "Match neighbor probabilities (KL divergence)",
            "Strength": "Great local cluster visualization",
            "Weakness": "Global geometry unreliable; typically no native transform",
            "Key knobs": "perplexity, learning_rate, init",
            "Out-of-sample?": "Not standard",
        },
        {
            "Method": "Isomap",
            "Main idea": "Preserve geodesic distances via shortest paths",
            "Strength": "Can unfold clean manifolds (strong global)",
            "Weakness": "Sensitive to noise/holes; graph must be well-connected",
            "Key knobs": "n_neighbors",
            "Out-of-sample?": "Limited",
        },
    ]
)

fig = go.Figure(
    data=[
        go.Table(
            header=dict(
                values=list(comparison.columns),
                fill_color="#f2f2f2",
                align="left",
                font=dict(size=12),
            ),
            cells=dict(
                values=[comparison[c] for c in comparison.columns],
                align="left",
                font=dict(size=11),
                height=28,
            ),
        )
    ]
)
fig.update_layout(title="UMAP vs t-SNE vs Isomap (high-level)", height=420)
fig.show()

Exercises#

  1. Pick a dataset you know (tabular, images, embeddings). Try UMAP with:

    • n_neighbors {5, 15, 50}

    • min_dist {0.0, 0.1, 0.5} Write down: what changes visually? what do you think changed conceptually?

  2. Measure local structure: compute trustworthiness(X, Y, n_neighbors=10) for several settings.

  3. For swiss-roll-like data, compare Isomap vs UMAP by increasing noise and seeing when Isomap breaks first.

References#

  • McInnes, Healy, Melville (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.

  • van der Maaten, Hinton (2008). Visualizing Data using t-SNE.

  • Tenenbaum, de Silva, Langford (2000). A Global Geometric Framework for Nonlinear Dimensionality Reduction (Isomap).

  • scikit-learn docs: TSNE, Isomap, and trustworthiness